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1, hypertext pages (e.g. www pages) and hyperlinks (e.g. HREFS): One potential 
affinity measure is to let: 

ai(u, v) = 1 if the link u v exists 
= 0 otherwise. 

Another possibility is: 

a2{u,v) = 1 if the link v u exists 
= 0 otherwise. 

In matrix terms, the latter affinity matrix is the transpose of the former i.e. A2 = AJ . In 
this case the associated similarity matrices might be, for example: 

Ml = AiAJ 
M2 — A2A2 

M3 = olAiA^ + (1 - <^)MA^ where a G [0, 1]. 
In component form: 

mi(ii, v) = X)t£; ^y)<^i(vi ^) 
m2{u,v) = Z)«,<^2(^,ty)a2(t;,^w) 
^^3(^1^) = aml{u^v) + (1 - a)m2{u^v) 
= + (1 - a)Eti;^2(^^,^)a2(v,tw) where a G [0, 1]. 

Note that: 

(a) m\{u^v) is a measure of the number of web pages "pointed to" by both u and v. 

(b) 7712 (ti, is a measure of the number of web pages that "point to" both u and v. 

(c) m^{u,v) is a weighted combination of mi and m2. Many more affinity measures and 
similarity definitions are possible. 

2. documents and terms: A potential affinity measure is the following: 

a(u, v) = 1 if document u contains term t; 
= 0 otherwise. 

Other possible affinity measures are; 



• a(u, v) = the number of times term v occurs in document u. 

• a(u, v) = the relative frequency of term t; in document u. This is equal to the number 
of occurences of term v divided by the total number of term occurences (all terms) in 
document u. 

• a{u,v) = the so-called TF/IDF measure of term v in document u, which is the 
frequency of term v in document u divided by the average frequency of terms v in the 
entire collection. 

Consider the following three-document example: 

• Documents: 

(a) the King James Bible 

(b) the novel Jaws 

(c) The Joy of Cooking 

• Terms: 

(a) thou 

(b) shark 

(c) flour 

(d) water 

In this case the document-term affinity matrix A = {a{u,v)} might look like the following: 

6000 10 100 200 
0 3215 40 3060 
0 133 3321 2856 

If we define the document-document similarity matrix to be M = AA^ then M 

36050100 648150 904630 
648150 19701425 9299795 
904630 9299795 19203466 

We see from this that, by this measure, Jaws and The Joy of Cooking are more similar to 
each other than to The King James Bible because of the frequencies of "water" and to a lesser 
extent the term "shark". 

collaborative filtering example - movie rating: In this case we have two sets of entities: 
movies and viewers. The affinity a{u, v) will be a number between 20 and 0 indicating the 
degree to which viewer u liked or disliked (if less than 10) movie v. Assume viewers Sam, 
Bill, Ellen, Fred and Mary and the following movies: 

(a) Star Wars 

(b) Die Hard 

(c) My Dinner With Andre 



(d) The Rocky Horror Show 

(e) Blade Runner 

(f ) The Remains of the Day 

(g) Taxi Driver 

(h) Dumb and Dumber 

We might, for example, have the following affinity matrix A — 

■ 20 20 0 7 17 2 16 10 ■ 
18 17 2 10 16 3 19 11 
14 2 17 9 10 19 10 0 , 

17 19 0 10 17 0 18 20 

18 10 16 14 14 19 12 0 

which would give the following movie-movie similarity matrix, using M = A^A: 
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Or, using M = AA^ gives a viewer-viewer similarity matrix M — 

' 1498 1462 751 1567 1126 ' 
1462 1464 817 1563 1175 
751 817 1131 716 1291 
1567 1563 716 1763 1090 
1126 1175 1291 1090 1577 

Table 3 shows principal and first non-principal affinity components of the movie-movie simi- 
larity matrix. The first gives a kind of popularity rating whereas the second shows clustering 
of movies. In this case the affinity components are the eigenvectors of the similarity matrix. 
The concept of the eigenvectors of a matrix is well known in the theory of linear algebra. 
They can be computed using any of a number of iterative algorithms Uke those covered by 
this invention. The eigenvector concept and a number of iterative algorithms for computing 
them are described in the book Matrix Computations by G. Golub and Charles Van Loan 
published in 1989 by the Johns Hopkins University Press (ISBN 0-8018-3739-1). Three clus- 
ters are evident. At one extreme are Die Hard and Dumb and Dumber, while at the other 
are The Remains of the Day and My Dinner With Andre, The others lie in a somewhat more 
diffuse central cluster; though Taxi Driver and Blade Runner are fairly tightly grouped. 
The first non-principal affinity component of the viewer-viewer similarity matrix is shown in 
Table 3, clearly indicating two clusters - men and women in this case. 
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Table 1: Principal and first non-principle affinity components (eigenvectors in this case) for movies 



Sam -0.2839 

BiU -0.2147 

Ellen 0.6238 

Fred -0.4291 

Mary 0.5478 



Table 2: First non-principle affinity (eigenvector) components for viewers. 



